1 Introduction

Mass spectrometry based single-cell proteomics (MS-SCP) allows the identification and quantification of hundreds to thousands of proteins at single cell resolution.

library(scpdata)

2 Specht et al. 2019

Reference: Specht Harrison, Edward Emmott, Toni Koller, and Nikolai Slavov. 2019. “High-Throughput Single-Cell Proteomics Quantifies the Emergence of Macrophage Heterogeneity.” bioRxiv.

Specht and colleagues applied their MS-SCP pipeline, called SCoPE2, to measure the increase in population heterogeneity when monocytes differentiate from macrophages. In the study, they profiled the proteome of 97 monocytes and 259 macrophages at single-cell resolution. They could identify 6787 unique peptides belonging to 2271 unique proteins.

In this section, we will reproduce the results obtained in the paper along with some data exploration. More information about the data set can be found in scpdata/inst/README.md or in the data set help file.

?specht2019

Let’s rename the data to a more convenient variable.

sc <- specht2019

2.1 Data exploration

Expression data is encoded as matrix whith each row is a feature (peptide or protein) and every column is a sample (sing cell). The most straightforward visualization of matrix-like data is the heatmap. The

The peptide expression data is very sparse (white space) what makes finding expression pattern difficult. Let’s dig into the missing data. The histogram below show the distribution of missing data among features (peptides) and among cells. The redline indicates the average missingness.

scp_plotMissing(sc)
## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 2 rows containing missing values (geom_bar).

On average, about 75% of the data is missing ! However, the distributions are drastically different depending if we look at peptide level or at cell level. The distribution of missing data in cells seems very close to a normal distribution. Since 2 cell types are present, the missingness could depend on cell type. This is investigated in the next plot.

There is no obvious relation between the rate of missing data and the cell type. The amount of missing could be influenced by the average peptide intensity (excluding missing values of course).

Again, no correlation between missingness and peptide intensity can be seen. Additional data is required to fully investigate the source of missing data. For instance, the missingness is probably linked with the peptide identification accuracy (FDR, PEP or identification score).

Next, we explore the structure of the peptide expression data. Recall that they were already preprocessed to some extend by the authors (filtered, normalized, log-transformed, see inst/README.md). The following plots show some basic summary statistics.

scp_plotStats(sc, xstat = "mean", ystat = "var")

Variance and average intensity observed among cells seems completely uncorrelated, and the ranges are relatively narrow, indicating that the quantification and identification performance is constant accross cells. On the other hand, there is a linear relationship between the variance and the average intensity for peptides. As the average intensity decreases, the variance increases. Interestingly, this is the inverse of what is expected for the Poisson or negative binomial distributions.

A benchmark measure often used by the Slavov Lab is the protein coefficient of variation (CV). CV is the standard deviation of the peptide intensities belonging to the same protein over their mean intensity. Here are the coefficients of variation for all proteins in every cell.

scp_plotCV(sc)

The CV is roughly constant across cells.

2.2 Normalization

The normalization procedure here follows the normalization procedure from the original paper. The first step is to subtract from each row (peptides) the mean intensity of that row. This is performed using our scp_normalize() function.

scNorm <- scp_normalize(sc, what = "row")

Next, the rows are collapsed by proteins, meaning that measurements for peptides that belong to the same proteins are merged in a single rozw. The rows from different peptides are summarized using the median value of the peptides.

scNorm <- scp_aggregateByProtein(scNorm)

Rows than columns are finally normalized again. For the rows the mean value is subtracted; for the columns the median value is subtracted. Note that the argument "both" implies that rows then colmuns are normalized.

scNorm <- scp_normalize(scNorm, what = "both")

This can be performed in a single run using pipes:

sc %>% scp_normalize(what = "row") %>%
       scp_aggregateByProtein() %>%
       scp_normalize(what = "both") -> scNorm

In the above section, we looked at the mean and standard deviation distributions of the original data set. We can redraw those plots after data normalization.

Since the last step is subracting the median intensity from every cell, it is expected that the median intensity is constant. For individual proteins, the median intensities seem to be normally distributed around 0. It should be noted that the normalization procedure used by Specht and colleagues does not normalize the variance. This should be taken into account when performing downstream analysis such as PCA.

Other imputation methods could be tested. For instance, the Linnorm method is a popular method that has shown good performance on scRNA-Seq data.

2.3 Imputation

we observed in the data exploration section that the specht2019 data set is highly sparse and contains a majority of missing entries. This large proportion of missing data hinders downstream analysis and imputation is performed to fill in the gaps. In their paper, Specht et al. implemented the k-nearest neighbour imputation:

scImput <- imputeKNN(scNorm)
cat("Missing data:", 
    round(sum(is.na(exprs(scImput))) / length(exprs(scImput)) * 100, 2), 
    "%")
## Missing data: 0 %

To asses the performance of this imputation, we have a look again at the data using heatmaps. Before imputation (after normalization), the data is very sparse.

After imputation:

Since the distance measure used for the KNN is Euclidean distance, it is interesting to look at how the similarity between cells is affected by the imputation:

We can see the imputation increases the correlation that exists between 3 groups of cells. The groups are exactly correlated with the batches defined by the combination of the chromatographic batch and another undocumented batch (the batch could be retrieved from the file names…). This strongly suggests that the SCoPE2 pipeline is subject to batch effect that cannot be corrected by the normalization to the reference channel. Therefore, batch correction using ComBat is applied (see below).

Imputation using KNN has been shown to create artifacts when applied on scRNA-Seq data (see Tian et al. (2019)). Better methods could be Drimpute or SAVER.

2.4 Batch correction and data integration

Specht et al. performed batch correction of the single cell data using the ComBat function from the sva package. ComBat is a batch correction method using empirical Bayes (EB) procedure developed by Johnson et al. (2007) to remove variation linked to experimental variables (eg day of processing) while preserving the other sources of variation, namely the biological variation. The method is based on a previous model called the location and scale adjustment (L/S) model.

scBc <- batchCorrect(scImput, batch = "raw.file", target = "celltype")
## Found62batches
## Adjusting for1covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

Let’s look again at the correlation heatmaps after reordering columns by cell type.

It seems that the batch effects seen in the previous plots are completely removed. ComBat is able to correct for batch effects without affecting the biological variation induced by cell type. Note that 62 batches (that is the 62 SCoPE2 sets) were used for removing the batch effects.

Let’s see if correct clusters can be found.

The clustering (hierarchical clustering with complete linkage and Euclidean distance) almost perfectly groups the cell types together. Some batch effect whithin the cell type groups seems still present as batches are more grouped than expected by chance. This could be either due to remaining technical bias or an artifact of the KNN imputation. However, we can see that the main expression patterns are correlated with cell type and indicate that the data contain sufficient protein expression information to discriminate between the undifferentiated monocytes and the macrophages.

2.5 Conclusion

The data from Specht et al. published in June 2019 is a successful test case for MS-based SCP showing that such a technology can be used for biological purposes. The authors were able with their data to differentiate between macrophages and monocytes. Although the data exhibit high missingness, a clear-cut differences between the cell type could still be observed, and is even more highlighted by PCA.

This allows the identification of protein preferentially expressed in one cell type rather than the other. A list of proteins associated to the patterns (eg associated with the first PC) can easily be extracted and biological inference and interpretation can be performed.

##      protein        svec
## 1917  Q9UBK2 -0.04609896
## 1298  Q13352  0.04595170
## 2131  Q6UX07  0.04579724
## 1359  O94985 -0.04517350
## 860   P32455 -0.04513235
##                                                                     annot
## 1917 Peroxisome proliferator-activated receptor gamma coactivator 1-alpha
## 1298                                                 Centromere protein R
## 2131                         Dehydrogenase/reductase SDR family member 13
## 1359                                                        Calsyntenin-1
## 860                                           Guanylate-binding protein 1

References

Tian, Luyi, Xueyi Dong, Saskia Freytag, Kim-Anh Lê Cao, Shian Su, Abolfazl JalalAbadi, Daniela Amann-Zalcenstein, et al. 2019. “Benchmarking Single Cell RNA-sequencing Analysis Pipelines Using Mixture Control Experiments.” Nat. Methods 16 (6): 479–87.